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Abstract 

In modern data analysis, one is frequently faced with statistical inference problems involving 
massive datasets. Processing such large datasets is usually viewed as a substantial computational 
challenge. However, if data are a statistician's main resource then access to more data should be 
viewed as an asset rather than as a burden. In this paper we describe a computational framework 
based on convex relaxation to reduce the computational complexity of an inference procedure 
when one has access to increasingly larger datasets. Convex relaxation techniques have been 
widely used in theoretical computer science as they give tractable approximation algorithms to 
many computationally intractable tasks. We demonstrate the efficacy of this methodology in 
statistical estimation in providing concrete time-data tradeoffs in a class of denoising problems. 
Thus, convex relaxation offers a principled approach to exploit the statistical gains from larger 
datasets to reduce the runtime of inference algorithms. 
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Introduction 

The rapid growth in the size and scope of datasets in science and technology has created a need 
for novel foundational perspectives on data analysis that blend computer science and statistics. 
That classical perspectives from these fields are not adequate to address emerging problems in 
"Big Data" is apparent from their sharply divergent nature at an elementary level — in computer 
science, the growth of the number of data points is a source of "complexity" that must be tamed via 
algorithms or hardware, whereas in statistics, the growth of the number of data points is a source 
of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. In 
classical statistics, where one considers the increase in inferential accuracy as the number of data 
points grows, there is little or no consideration of computational complexity. Indeed, if one imposes 
the additional constraint — prevalent in real-world applications — that a certain level of inferential 
accuracy be achieved within a limited time budget, classical theory provides no guidance as to how 
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to design an inferential strategy^ In classical computer science, practical solutions to large-scale 
problems are often framed in terms of approximations to idealized problems, but even when such 
approximations are sought, they are rarely expressed in terms of the coin of the realm of the theory 
of inference — the statistical risk function. Thus there is little or no consideration of the idea that 
computation can be simplified in large datasets because of the enhanced inferential power in the 
data. In general, in computer science, datasets are not viewed formally as a resource on a par with 
time and space (such that the more of the resource the better). 

On intuitive grounds it is not implausible that strategies can be designed that yield mono- 
tonically improving risk as data accumulate, even in the face of a time budget. In particular, if 
an algorithm simply ignores all future data once a time budget is exhausted, then statistical risk 
will not increase (under various assumptions that may not be desirable in practical applications). 
Alternatively, one might allow linear growth in the time budget (for example, in a real-time set- 
ting), and attempt to achieve such growth via a subsampling strategy where some fraction of the 
data are dropped. Executing such a strategy may be difficult, however, in that the appropriate 
fraction depends on the risk function and thus on a mathematical analysis that may be difficult to 
carry out. Moreover, subsampling is a limited strategy for controlling computational complexity. 
More generally, one would like to consider some notion of "algorithm weakening," where as data 
accumulate one can back off to simpler algorithmic strategies that nonetheless achieve a desired 
risk. The challenge is to do this in a theoretically sound manner. 

We base our approach to this problem on the notion of a "time-data complexity class." In 
particular, we define a class TO(t(p),n(p),e(p)) of parameter estimation problems in which a p- 
dimensional parameter underlying an unknown population can be estimated with a risk of e(p) 
given n(p) i.i.d. samples using an inference procedure with runtime t(p). Our definition parallels 
the definition of the TISP complexity class in computational complexity theory for describing 
algorithmic tradeoffs between time and space resources [3]. In this formalization, classical results 
in estimation theory can be viewed as emphasizing the tradeoffs between the second and third 
parameters (amount of data and risk). Our focus in this paper is to fix e(p) to some desired level 
of accuracy and to investigate the tradeoffs between the first two parameters, namely runtime and 
dataset size. 

Although classical statistics gave little consideration to computational complexity, computa- 
tional issues have come increasingly to the fore in modern "high-dimensional statistics" [13], where 
the number of parameters p is relatively large and the number of data points n relatively small. In 
this setting, methods based on convex optimization have been emphasized (in particular methods 
based on l\ penalties). This is due in part to the favorable analytic properties of convex functions 
and convex sets, but also due to the fact that such methods tend to have favorable computational 
scaling. However, the treatment of computation has remained informal, with no attempt to char- 
acterize tradeoffs between computation time and estimation quality. In our work we aim explicitly 
at such tradeoffs, in the setting in which both n and p are large. 

To develop a notion of "algorithm weakening" that combines computational and statistical con- 
siderations, we consider estimation procedures for which we can characterize the computational 
benefits as well as the loss in estimation performance due to the use of weaker algorithms. Reflect- 
ing the fact that the space of all algorithms is poorly understood, we retain the focus on convex 
optimization from high-dimensional statistics, but we consider parameterized hierarchies of opti- 
mization procedures in which a form of algorithm weakening is obtained by employing successively 

1 Note that classical statistics contains a branch known as sequential analysis that does discuss methods that 
stop collecting data points after a target error level has been reached (see, e.g., [35]), but this is different from the 
computational complexity guarantees (the number of steps that a computational procedure requires) that are our 
focus. 
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weaker outer approximations to convex sets. Such convex relaxations have been widely used to 
give efficient approximation algorithms for intractable problems in computer science [53]. As we 
will discuss, a precise characterization of both the estimation performance and the computational 
complexity of employing a particular relaxation of a convex set can be obtained by appealing to 
convex geometry and to results on the complexity of solving convex programs. Specifically, the 
tighter relaxations in these families offer better approximation quality (and in our context better 
estimation performance) but such relaxations are computationally more complex. On the other 
hand the weaker relaxations are computationally more tractable, and they can provide the same 
estimation performance as the tighter ones but with access to more data. In this manner, convex 
relaxations provide a principled mechanism to weaken inference algorithms in order to reduce the 
runtime in processing larger datasets. 

To demonstrate explicit tradeoffs in high-dimensional, large-scale inference, we focus for sim- 
plicity and concreteness on estimation in sequence models |31j : 

y = x*+ CT z, (1) 

where a > 0, the noise vector z E W is standard normal, and the unknown parameter x* belongs 
to a known subset S C MP. The objective is to estimate x* based on n independent observations 
{y«}?=i of y. This denoising setup has a long history and has been at the center of some remarkable 
results in the high-dimensional setting over the past two decades beginning with the papers of 
Donoho and Johnstone [HI [E]. The estimators discussed next proceed by first computing the 
sample mean y = Y^i=i y« an d then using y as input to a suitable convex program. Of course, this 
is equivalent to a denoising problem in which the noise variance is a 2 jn and we are given just one 
sample. The reason we consider the elaborate two-step procedure is to account more accurately 
both for data aggregation and for subsequent processing in our runtime calculations. Indeed, in a 
real-world setting one is typically faced with a massive dataset in unaggregated form, and when 
both p and n may be large, summarizing the data before any further processing can itself be an 
expensive computation. As will be seen in concrete calculations of time-data tradeoffs, the number 
of operations corresponding to data aggregation is sometimes comparable to or even larger than 
the number of operations required for subsequent processing in a massive data setting. 

In order to estimate x*, we consider the following natural shrinkage estimator given by a pro- 
jection of the sample mean y onto a convex set C that is an outer approximation to S, i.e., S C C: 

1 2 

x n (C)=argmin - ||y - x||^ s.t. x G C. (2) 

We study the estimation performance of a family of shrinkage estimators {x n (Cj)} that employ as 
the convex constraint one of a sequence of convex outer approximations {C{\ with C\ D Ci D ■ ■ • D S. 
Given the same number of samples, using a weaker relaxation such as C\ leads to an estimator with 
a larger risk than would result from using a tighter relaxation such as 62- On the other hand, given 
access to more data samples the weaker approximations provide the same estimation guarantees 
as the tighter ones. In settings in which computing a weaker approximation is more tractable 
than computing a tighter one, a natural computation/sample tradeoff arises. We characterize this 
tradeoff in a number of stylized examples, motivated by problems such as collaborative filtering, 
learning an ordering of a collection of random variables, and inference in networks. 

More broadly, this paper highlights the role of computation in estimation by jointly studying 
both the computational and the statistical aspects of high-dimensional inference. Such an under- 
standing is particularly of interest in modern inferential tasks in data-rich settings. Further, an 
observation from our examples on time-data tradeoffs is that in many contexts one does not need 
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too many extra data samples in order to go from a computationally inefficient estimator based on 
a tight relaxation to an extremely efficient estimator based on a weaker relaxation. Consequently, 
in application domains in which obtaining more data is not too expensive it may be preferable to 
acquire more data with the upshot being that the computational infrastructure can be relatively 
less sophisticated. 

We should note that we investigate only one algorithm weakening mechanism, namely convex 
relaxation, and one class of statistical estimation problems, namely denoising in a high-dimensional 
sequence model. There is reason to believe, however, that the principles described in this paper are 
relevant more generally. Convex-optimization-based procedures are employed in a variety of large- 
scale data analysis tasks [111 I13j , and it is likely to be interesting to explore hierarchies of convex 
relaxations in such tasks. In addition, there are a number of potentially interesting mechanisms 
beyond convex relaxation for weakening inference procedures such as dimensionality reduction or 
other forms of data quantization, and approaches based on clustering or coresets. We discuss these 
and other research directions in the Conclusions. 

Related work A number of papers have considered computational and sample complexity trade- 
offs in the setting of learning binary classifiers. Specifically, several authors have described settings 
under which speedups in running time of a classifier learning algorithm are possible given a sub- 
stantial increase in dataset size [48J [TBI HZ]- in contrast, in the denoising setup considered in 
this paper, several of our examples of time-data tradeoffs demonstrate significant computational 
speedups with just a constant factor increase in dataset size. Another attempt in the binary clas- 
sifier learning setting, building on earlier work on classifier learning in data-rich problems [TO], has 
shown that modest improvements in runtime (of constant factors) may be possible with access to 
more data by employing the stochastic gradient descent method |49| . Time-data tradeoffs have also 
been characterized in Boolean network training from time series data [12], but the computational 
speedups offered there are from exponential-time algorithms to slightly faster but still exponential- 
time algorithms. Two recent papers [31 [3l] have considered time-data tradeoffs in sparse principal 
component analysis (PCA) and in biclustering, in which a sparse rank-one matrix is corrupted by 
noise and the objective is to recover the support of the matrix. We also study time-data tradeoffs in 
the estimation of a sparse rank-one matrix, but from a denoising perspective. In our discussion of 
Example 3 below we discuss the differences between our problem setup and these latter two papers. 
As a general contrast to all these previous results, a major contribution of the present paper is the 
demonstration of the efficacy of convex relaxation as a powerful algorithm weakening mechanism 
for processing massive datasets in a broad range of settings. 

Paper outline The main sections of this paper proceed in the following sequence. The next 
section describes a framework for formally stating results on time-data tradeoffs. Then we provide 
some background on convex optimization and relaxations of convex sets. Following this we inves- 
tigate in detail the denoising problem ([TJ), and characterize the risk obtained when one employs a 
convex programming estimator of the type ([2]) . Subsequently, we give several examples of time-data 
tradeoffs in concrete denoising problems. Finally, we conclude with a discussion of directions for 
further research. 

Formally Stating Time-Data Tradeoffs 

In this section we describe a framework to state results on computational and statistical tradeoffs 
in estimation. Our discussion is relevant to general parameter estimation problems and inference 
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Figure 1: The tradeoff plot between the runtime and sample complexity in a typical parameter 
estimation problem. Here the risk is assumed to be fixed to some desired level, and the points 
in the plot refer to different algorithms that require a certain runtime and a certain number of 
samples in order to achieve the desired risk. The vertical and horizontal lines refer to lower bounds 
in sample complexity and in runtime, respectively. 



procedures; one may keep in mind the denoising problem (JTJ) for concreteness. Consider a sequence 
of estimation problems indexed by the dimension p of the parameter to be estimated. Fix a risk 
function e(p) that specifies the desired error of an estimator. For example, in the denoising problem 
(PQ) the error of an estimator of the form ([2]) may be specified as the worst case mean squared error 
taken over all elements of the set 5, i.e., sup x « gi5 E [||x* — x n (C)||| ] . 

One can informally view an estimation algorithm that achieves a risk of e(p) by processing n(p) 
samples with runtime t(p) as a point on a two-dimensional plot such as Figure 1, with one axis 
representing the runtime and the other representing the sample complexity. To be precise the axes 
in the plot index functions (of p) that represent runtime and number of samples, but we do not 
emphasize such formalities and rather use these plots to provide a useful qualitative comparison of 
inference algorithms. In Figure 1, procedure A requires fewer samples than procedure C to achieve 
the same error, but this reduction in sample complexity comes at the expense of a larger runtime. 
Procedure B has both a larger sample complexity and a larger runtime than procedure C, and thus 
it is strictly dominated by procedure C. 

Given an error function e(p), there is a lower bound on the number of samples n(p) re- 
quired to achieve this error using any computational procedure (i.e., no constraints on t(p)) — such 
information-theoretic or minimax risk lower bounds correspond to "vertical lines" in the plot in 
Figure 1. Characterizing these fundamental limits on sample complexity has been a traditional 
focus in the estimation theory literature with a fairly complete set of results available in many set- 
tings. One can imagine asking for similar lower bounds on the computational side, corresponding 
to "horizontal" lines in the plot in Figure 1 — given a desired risk e{p) and access to an unbounded 
number of samples, what is a non-trivial lower bound on the runtime t(p) of any inference algo- 
rithm that achieves a risk of e{p)l Such complexity-theoretic lower bounds are significantly harder 
to obtain, and they remain a central open problem in computational complexity theory. 

This research landscape informs the qualitative nature of the statements on time-data tradeoffs 
we make in this paper. First, we will not attempt to prove combined lower bounds — as is tradition- 
ally done in the characterization of tradeoffs between physical quantities — involving n{p) and t{p) 
jointly; this is because obtaining a lower bound just on t(p) remains a substantial challenge. Hence, 
our time-data tradeoff results on the use of more efficient algorithms for larger datasets refer to a 
reduction in the upper bounds on runtimes of estimation procedures with increases in dataset size. 
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Second, in any setting in which there is a computational cost associated with touching each data 
sample and in which the samples are exchangeable, there is a sample threshold beyond which it 
is computationally more efficient to throw away excess data samples than to process them in any 
form. This observation suggests that there is a "floor," as in Figure 1 with procedures E, F, G and 
H, beyond which additional data do not lead to a reduction in runtime. Precisely characterizing 
this sample threshold is in general very hard as it depends on difficult-to-obtain computational 
lower bounds for estimation tasks and also on the particular space of estimation algorithms that 
one may employ. We will comment further on this point when we consider concrete examples of 
time-data tradeoffs. 

In order to formally state our results concerning time-data tradeoffs, we define a resource class 
constrained by runtime and sample complexity as follows. 

Definition 1. Consider a sequence of parameter estimation problems indexed by the dimension 
p of the space of parameters that index an underlying population. This sequence of estimation 
problems belongs to a time-data class TB(i(p), n(p), e{p)) if there exists an inference procedure for 
the sequence of problems with runtime upper-bounded by t(p), with the number of i.i.d. samples 
processed bounded by n{p), and which achieves a risk bounded by e(p). 

We note that our definition of a time-data resource class parallels the time-space resource classes 
considered in complexity theory (3j — in that literature TISP(i(p), s(p)) denotes a class of problems 
of input size p that can be solved by some algorithm using t(p) operations and utilizing s(p) units 
of space. 

With this formalism, classical minimax bounds can be stated as follows. Given some function 
n(p) for the number of samples, suppose a parameter estimation problem has a minimax risk of 
eminimax(j') (which depends on the function n(p)). If an estimator achieving a risk of e m i n i max (j>) 
is computable with runtime i(p), then this estimation problem lies in TD(t(p), e m i n i max (p)). 
Thus the emphasis is fundamentally on the relationship between fi{p) and e m i n i max (p), without 
much focus on the computational procedure that achieves the minimax risk bound. Our interest 
in this paper is to fix the risk e(p) = edcsired(p) to be equal to some desired level of accuracy, and 
to investigate the tradeoffs between t(p) and n(p) so that a parameter estimation problem lies in 

TD (t (p) , n (p) , e des ired 0) ) • 

Convex Relaxation 

In this section we describe the particular algorithmic toolbox on which we focus, namely convex 
programs. Convex optimization methods offer a powerful framework for statistical inference due to 
the broad class of estimators that can be effectively modeled as convex programs. Further the theory 
of convex analysis is useful both for characterizing the statistical properties of convex programming 
based estimators as well as for developing methods to compute such estimators efficiently. Most 
importantly from our viewpoint, convex optimization methods provide a principled and general 
framework for algorithm weakening based on relaxations of convex sets. We briefly discuss the key 
ideas from this literature that are relevant to this paper in this section. A central notion to the 
geometric viewpoint adopted in this section is that of a convex cone, which is a convex set that is 
closed under nonnegative linear combinations. 

Representation of Convex Sets 

Convex programs refer to a class of optimization problems in which we seek to minimize a convex 
function over a convex constraint set For example linear programming and semidefinite pro- 
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gramming are two prominent subclasses in which linear functions are minimized over constraint 
sets given by affine spaces intersecting the nonnegative orthant (in linear programming) and the 
positive semidefinite cone (in semidefinite programming). Roughly speaking convex programs are 
tractable to solve computationally if the convex objective function can be computed efficiently, and 
if membership in the convex constraint sets can be certified efficientl>H; we will informally refer 
to this latter operation as computing the convex constraint set. It is then clear that the main 
computational bottleneck associated with solving convex programs of the form ([2]) is the efficiency 
of computing the constraint sets. 

A central insight from the literature on convex optimization is that the complexity of computing 
a convex set is closely linked to how efficiently the set can be represented. Specifically, if a convex 
set can be expressed as the intersection of a small number of "basic" or "elementary" convex sets, 
each of which is tractable to compute, then the original convex set is also tractable to compute and 
one can in turn optimize over this set efficiently. Examples of "basic" convex sets include affine 
spaces or cones such as the nonnegative orthant and the cone of positive semidefinite matrices. 
Indeed, a canonical method to represent a convex set is to express the set as the intersection of a 
cone and an affine space. In what follows we will consider such conic representations of convex sets 
in W. 

Definition 2. Let C GW be a convex set and let tC G W be a convex cone. Then C is said to be 
/C-representable if C can be expressed as follows for A G M mxp , b G W 71 : 

C = {x|x G tC,Ax = b}. (3) 

Such a representation of C is called a /C-representation. 

Informally, if fC is the nonnegative orthant (or the semidefinite cone) we will refer to the re- 
sulting representations as LP representations (or SDP representations), following commonly used 
terminology in the literature. A virtue of conic representations of convex sets based on the orthant 
or the semidefinite cone is that these representations lead to a numerical recipe for solving convex 
optimization problems of the form ([2]) via a natural associated barrier penalty [38]. The compu- 
tational complexity of these procedures is polynomial in the dimension of the cone and we discuss 
runtimes for specific instances in our discussion of concrete examples of time-data tradeoffs. 

Example 1. The p- dimensional simplex is an example of an LP representable set: 

A p = {x|l'x = l,x > 0}, (4) 

where 1 G W is the all ones vector. 

The p-simplex is the set of probability vectors in R p . The next example is one of an SDP- 
representable set that is commonly encountered both in optimization and in statistics. 

Example 2. The elliptope, or the set of correlation matrices, in the space of m x m symmetric 
matrices is defined as follows: 

£mxm = {X|X >: 0, Xjj = 1 Vi}. (5) 

2 More precisely, one requires an efficient separation oracle that responds YES if the point is in the convex set, 
and otherwise provides a hyperplane that separates the point from the convex set. 
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Conic representations are somewhat limited in their modeling capacity, and an important gener- 
alization is obtained by considering lifted representations. In particular the notion of lift-and-project 
plays a critical role in many examples of efficient representations of convex sets. The lift-and-project 
concept is simple — we wish to express a convex set C G W as the projection of a convex set C G W 
in some higher-dimensional space (i.e., p' > p). The complexity of solving the associated convex 
programs is now a function of the lifting dimension p'. Thus, lift-and-project techniques are useful 
if p' is not too much larger than p and if C has an efficient representation in the higher-dimensional 
space W . Lift-and-project provides a very powerful representation tool, as seen in the following 
example. 

Example 3. The cross-polytope is the unit ball of the t\-norm: 

Bl = jxGRP | J^|xi| < lj. 

The l\-norm has been the focus of much attention recently in statistical model selection and feature 
selection due to its sparsity-inducing properties While the cross-polytope has 2p vertices, 

a direct specification in terms of linear constraints involves 2 P inequalities: 



Bl= xG 



^ ZiXi <l, VzG{-l,+l} p j. 



However we can obtain a tractable representation by lifting to M. 2p and then projecting onto the first 
p coordinates: 



B p h = jx G M. p I 3z G M p s.t. - Zi < Xj < Zi Mi, ^ z { < 1 j . 

Note that in M. 2p with the additional variables z, we have only 2p + 1 inequalities. 

Another example of a polytope that requires many inequalities in a direct description is the 
permutahedron [M] — the convex hull of all the permutations of the vector [1, . . . ,p]' G M. p . In fact 
the permutahedron requires exponentially many linear inequalities in a direct description, while a 
lifted representation involves 0(plog(p)) additional variables and about 0(plog(p)) inequalities in 
the higher-dimensional space [25j . We refer the reader to the literature on conic representations for 
other examples (see [27J and the references therein), including lifted semidefinite representations. 



Hierarchies of Convex Relaxations 

In many cases of interest, convex sets may not have tractable representations. Lifted representations 
in such cases have lifting dimensions that are super-polynomially large in the dimension of the 
original convex set, and thus the associated numerical techniques lead to intractable computational 
procedures that have super-polynomial runtime with respect to the dimension of the original set. 
A prominent example of a convex set that is difficult to compute is the cut polytope : 

CUT mxm = convjmm' | m G {-1, +l} m }. (6) 

Rank-one signed matrices and their convex combinations are of interest in collaborative filtering and 
clustering problems (see the section on time-data tradeoffs). There is no known tractable represen- 
tation of the cut polytope — lifted linear or semidefinite representations have lifting dimensions that 
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are super-polynomial in size. Such computational issues have led to a large literature on approxi- 
mating intractable convex sets by tractable ones. For the purposes of this paper, and following the 
dominant trend in the literature, we focus on outer approximations. For example, the elliptope ([5]) 
is an outer relaxation of the cut polytope, and it has been employed in approximation algorithms 
for intractable combinatorial optimization problems such as finding the maximum-weight cut in a 
graph [26]. More generally, one can imagine a hierarchy of increasingly tighter approximations {C,} 
of a convex set C as follows: 

c c ••• cc 3 cc 2 cci. 

There exist several mechanisms for deriving such hierarchies, and we describe three frameworks 
here. 

In the first framework, which was developed by Sherali and Adams [50], the set C is assumed 
to be polyhedral and each element of the family {C{\ is also polyhedral. Specifically, each Cj is 
expressed via a lifted LP representation. Tighter approximations are obtained by resorting to 
larger-sized lifts so that the lifting dimension increases with the level i in the hierarchy. The 
second framework is similar in spirit to the first one, but now the set C is a convex basic, closed 
semialgebraic selH and the approximations {Cj} are given by lifted SDP representations. Again the 
lifting dimension increases with the level i in the hierarchy. This method was initially pioneered 
by Parrilo [40], [41] and by Lasserre [36], and it was studied in greater detail subsequently by 
Gouveia et al. [28]. Both these first and second frameworks are similar in spirit in that tighter 
approximations are obtained by via lifted representations with successively larger lifting dimensions. 
The third framework we mention here is qualitatively different from the first two. Suppose C is a 
convex set that has a /C-representation — by successively weakening the cone fC itself one obtains 
increasingly weaker approximations to C. Specifically, we consider the setting in which the cone 
fC is a hyperbolicity cone [43j . Such cones have rich geometric and algebraic structure, and their 
boundary is given in terms of the vanishing of hyperbolic polynomials. They include the orthant 
and the semidefinite cone as special cases. We do not go into further technical details and formal 
definitions of these cones here, and instead refer the interested reader to [43]. The main idea is 
that one can obtain a family of relaxations {/Q} to a hyperbolicity cone tC C W where each /Q is 
a convex cone (in fact, hyperbolic) and is a subset of W: 

£C...C/C 3 C£ 2 C/Ci. (7) 

These outer conic approximations are obtained by taking certain derivatives of the hyperbolic 
polynomial used to define the original cone K — see [43] for more details. One then constructs a 
hierarchy of approximations {C{\ to C by replacing the cone K, in the representation of C by the 
family of conic approximations {/Q}. From ([3]) and ([7]) it is clear that the {Cj} so defined satisfy 
C c ••• cc 3 cc 2 cCi. 

The important point in these three frameworks is that the family of approximations {Cj} ob- 
tained in each case is ordered both by approximation quality as well as by computational complexity; 
that is, the weaker approximations in the hierarchy are also the ones that are more tractable to 
compute. This observation leads to an algorithm weakening mechanism that is useful for processing 
larger datasets more coarsely. As demonstrated concretely in the next section, the estimator (|2"j) 
based on a weaker approximation to C can provide the same statistical performance as one based 
on a stronger approximation to C provided the former estimator is evaluated with more data. The 
upshot is that the first estimator is more tractable to compute than the second. Thus, we obtain 
a technique for reducing the runtime required to process a larger dataset. 

3 A basic, closed semialgebraic set is the collection of solutions of a system of polynomial equations and polynomial 
inequalities [S]. 
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Estimation via Convex Optimization 



In this section we investigate the statistical properties of the estimator ([2]) for the denoising problem 
([1]). The signal set S in ([1]) differs based on the application of interest. For example S may be the set 
of sparse vectors in a fixed basis, which could correspond to the problem of denoising sparse vectors 
in wavelet bases |17j . The signal set S may be the set of low-rank matrices, which leads to problems 
of collaborative filtering [52j. Finally, S may be a set of permutation matrices corresponding to 
rankings over a collection of items. Our analysis in this section is general, and is applicable to 
these and other settings (see the section on time-data tradeoffs for concrete examples). In some 
denoising problems, one is interested in noise models other than Gaussian. We comment on the 
performance of the estimator ([2]) in settings with non-Gaussian noise, although we primarily focus 
on the Gaussian case for simplicity. 

Convex Programming Estimators 

In order to analyze the performance of the estimator ([2]), we introduce a few concepts from convex 
analysis [44j . Given a closed convex set C £l p and a point a G C we define the tangent cone at a 
with respect to C as 

T c (a) = cone{b - a | b G C}. (8) 

Here cone(-) refers to the conic hull of a set obtained by taking nonnegative linear combinations of 
elements of the set. The cone Tc(a) is the set of directions to points in C from the point a. The 
polar JC* C MP of a cone fC C M p is the cone 

K* = {h G M p | (h, d> < Vd G K}. 

The normal cone Nq{sl) at a with respect to the convex set C is the polar cone of the tangent cone 
2b(a): 

iV c (a)=T c (a)*. (9) 

Thus, the normal cone consists of vectors that form an obtuse angle with every vector in the tangent 
cone Tc(x). Both the tangent and normal cones are convex cones. 

A key quantity that will appear in our error bounds is the following notion of the "complexity" 
or "size" of a tangent cone: 

Definition 3. The Gaussian squared-complexity of a set T> £M P is defined as: 



g(V) = E 



sup (a,g) 

,ael> 



where the expectation is with respect to g ~ AA(0, I pxp ). 

This quantity is closely related to the Gaussian complexity of a set [201 E] which consists of 
no squaring of the term inside the expectation. The Gaussian squared-complexity shares many 
properties in common with the Gaussian complexity, and we describe those that are relevant to 
this paper in the next subsection. Specifically, we discuss methods to estimate this quantity for 
sets T> that have some structure. 

With these definitions and letting denote the £2 ball in M p , we have the following result on 
the error between x n (C) and x*: 

Proposition 4. For x* G S C MP and with C C M p convex such that S C C, we have the error 
bound 

E[||x*-x n (C)|||] <^ 5 (T c (x*)n^ 2 ). 



10 



Proof: We have that y = x* + -^Z- We begin by establishing a bound that is derived by 
conditioning on z = z. Subsequently, taking expectations concludes the proof. We have from the 
optimality conditions [44] of the convex program ([2]) that 

X* + - X n (C)| z=z e N C (±n(C)\ z=z ). 

Here x n (C)| z=z represents the optimal value of ([5]) conditioned on z = z. As x* € S C C, we have 
that x* — x„(C)| z=z 6 7c(x n (C)| z=z ). Since the normal and tangent cones are polar to each other, 
we have that 

(x* + -^z - x n (C)| z=z ,x* - x„(C)| z=z ) < 0. 

It then follows that 

|| x — x n(C) |z=z 11^2 

< ^<x n (C)| z=z -x*,z) 

it n« //7\| *ii / x n(C) z= z — X _\ 

= ^IMQI^-xll^ p-^— - ^- ^ 

sup (d, z) . 

deT c (x*),||d|U 2 <i 

Dividing both sides by ||x n (C)| z=z — x*||^ 2 , then squaring both sides, and finally taking expectations 
completes the proof. □ 

Note that the basic structure of the error bound provided by the estimator ([2]) in fact holds for 
an arbitrary distribution on the noise z with the Gaussian squared-complexity suitably modified. 
However, we focus for the rest of this paper on the Gaussian case, z ~ JV(0, I pX p)- 

To summarize in words, the mean squared error is bounded by the noise variance times the 
Gaussian squared-complexity of the normalized tangent cone with respect to C at the true parameter 
x*. Essentially it measures the amount of noise restricted to the tangent cone, which is intuitively 
reasonable as only the noise that moves one away from x* in a feasible direction in C must contribute 
towards the error. Therefore, if the convex constraint set C is "sharp" at x* so that the cone Tc(x*) 
is "narrow," then the error is small. At the other extreme, if the constraint set C = M p then the 

2 

error is — p as one would expect. 

While Proposition 0] is useful and indeed it will suffice for the purposes of demonstrating time- 
data tradeoffs in the next section, there are a couple of shortcomings in the result as stated. 
First, suppose the signal set S is contained in a ball around the origin with the radius of the ball 

2 

being small relative to the noise variance In such a setting, the estimator x = leads to a 
smaller mean squared error than one would obtain from Proposition 2J Second, and somewhat 
more subtly, suppose that one doesn't have a perfect bound on the size of the signal set. For 
concreteness, consider a setting in which S is a set of sparse vectors with bounded t\ norm, in 
which case a good choice for the constraint set C in the estimator ([I]) is an appropriately scaled 
t\ ball. However, if we do not know the i\ norm of x* a "priori, then we may end up employing a 
constraint set C such that x* does not belong to C (so x* is an infeasible solution) or that x* lies 
strictly in the interior of C (hence, Tc(x*) is all of MP). Both these situations are undesirable as 
they limit the applicability of Proposition |4] and provide very loose error bounds. The following 
result addresses these shortcomings by weakening the assumptions of Proposition H) 

Proposition 5. Let x* £ S C MP and let C C M p . Suppose there exists a point x £ C such that 
C — x = Qi © Q2, with Qi,Q2 lying in orthogonal subspaces ofM p and Q2 Q aBP for a > 0. Then 



< -7- 



l x n(C)| 5 



X 



\('2 
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we have that 



E [||x* - *„(C)[|f 2 ] < 6 [^(cone(Qi) n + 

II * ~l|2 , 2 

||x - x|| £a + a 

-fTere cone(Qi) is the conic hull of Q\. 

The proof of this result is presented in the Supplementary Information. A number of remarks 
are in order here. With respect to the first shortcoming in Proposition 0] stated above, if C is 
chosen such that S C C one can set x = x*, Q\ = 0, and Q2 = C — x in Proposition [5l and readily 
obtain a bound that scales only with the diameter of the convex constraint set C. With regard to 
the second shortcoming in Proposition 2] described above, if a point x 6 C near x* has a narrow 
tangent cone Tc(x), then one can provide an error bound with respect to g(Tc(x)) with an extra 
additive term that depends on ||x* — x||| — this is done by setting Q\ = C — x and Q2 = (thus, 
a = 0) in Proposition [5j More generally, Proposition [5] incorporates both these improvements in 
a single error bound with respect to an arbitrary point x S C; thus, one can further optimize the 
error bound over the choice of x G C (as well as the choice of the decomposition Q\ and C^)- 

Properties and Computation of Gaussian Squared-Complexity 

We record some properties of the Gaussian squared-complexity that are subsequently useful when 
we demonstrate concrete time-data tradeoffs. It is clear that g(-) is monotonic with respect to 
set nesting, i.e., g{T>\) < g(T>2) for sets T>\ C 2? 2 - If 2? is a subspace then one can check that 
g(T>) = dim(P). In order to estimate squared-complexities of families of cones, one can imagine 
appealing to techniques similar to those used for estimating Gaussian complexities of sets [201 E]- 
Most prominent among these are arguments based on covering number and metric entropy bounds. 
However, these arguments are frequently not sharp and introduce extraneous log- factors in the 
resulting error bounds. 

In a recent paper [15], sharp upper bounds on the Gaussian complexities of normalized cones 
have been established for families of cones of interest in a class of linear inverse problems. The 
(square of the) Gaussian complexity can be upper bounded by the Gaussian squared-complexity 
g(T>) via Jensen's inequality: 



E 



sup (d,g) 
de£> 



2 



<g(T>), 



where g is a standard normal vector. In fact most of these bounds in |15| were obtained by bounding 
g(T>) and thus they are directly relevant to our setting. In the rest of this section we present the 
bounds on g(T>) from j!5j that will be used in this paper, deferring to that paper for proofs in most 
cases. In some cases the proofs do require modifications with respect to their counterparts in |15| . 
and for these cases we give full proofs in the Supplementary Information. 

The first result is a direct consequence of convex duality and provides a fruitful general technique 
to compute sharp estimates of Gaussian squared-complexities. Let dist(a, T>) denote the £2 distance 
from a point a to the set T>. 

Lemma 1. ,'i<5]/ Let fC Cl p be a convex cone and let tC* C MP be its polar. Then we have for any 
&£W that 

sup (d, a) = dist(a, /C*). 
Therefore we have the following result as a simple corollary. 
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Corollary 6. Let K. CP be a convex cone and let K* C W be its polar. For g ~ AA(0, I pX p) we 
have that 

5 (/Cn J Bf 2 )=E[dist(g,/C*) 2 ]. 

Based on the duality result of Lemma [T] and Corollary [6l one can compute the following sharp 
bounds on the Gaussian squared-complexities of tangent cones with respect to the l\ norm and 
nuclear norm balls. These are especially relevant when one wishes to estimate sparse signals or 
low-rank matrices — in these settings the l\ and nuclear norm balls serve as useful constraint sets for 
denoising as the tangent cones with respect to these sets at sparse vectors and at low-rank matrices 
are particularly narrow. Both these results are used when we describe time-data tradeoffs. 

Proposition 7. [15] Let x G W be a vector containing s nonzero entries. Let T be the tangent 
cone at x with respect to an l\ norm ball scaled so that x lies on the boundary of the ball, i.e., a 
scaling of the unit l\ norm ball by a factor HxH^. Then 

5 (rnB £ p 2 )<2 S iog(f) + f s . 

Next we state a result for low-rank matrices and the nuclear norm ball. 

Proposition 8. \19j Let X E ]j m i xm 2 £ e ma f r i x of rank r. Let T be the tangent cone at X with 
respect to a nuclear norm ball scaled so that X lies on the boundary of the ball, i.e., a scaling of 
the unit nuclear norm ball by a factor equal to the nuclear norm o/X. Then 

g(T n B™ imz ) < 3r(mi +m 2 - r). 

Next we state and prove a result that allows us to estimate Gaussian squared-complexities of 
general cones. The bound is based on the volume of the dual of the cone of interest, and the proof 
involves an appeal to Gaussian isoperimetry [37]. A similar result on Gaussian complexities of cones 
(without the square) was proved in [15], but that result does not directly imply our statement and 
we therefore give a complete self-contained proof in the Supplementary Information. The volume 
of a cone is assumed to be normalized (between zero and one) so we consider the relative fraction 
of a unit Euclidean sphere that is covered by a cone: 

Proposition 9. Let K, C W be a cone such that its polar K* C W has a normalized volume of 
fx 6 (| exp{— p/20}, XT'). For p > 12, we have that 

«7(/cnfl£)<20iog(£). 

If a cone is narrow then its polar will be wide leading to a large value of fi and hence a small 
quantity of the right-hand-side of the bound. This result leads to bounds on Gaussian squared- 
complexity in settings in which one can easily obtain estimates of volumes. One setting in which 
such estimates are easily obtained is the case of tangent cones with respect to vertex transitive 
polytopes. We recall that a vertex transitive polytope [33] is one in which there exists a symmetry 
of the polytope for each pair of vertices mapping the two vertices isomorphically to each other. 
Roughly speaking, all the vertices in such polytopes are the same. Some examples include the 
cross-polytope (the l\ norm ball), the simplex Q, the hypercube (the norm ball), and many 
polytopes generated by the action of groups [46]. We will see many examples of such polytopes in 
our examples on time-data tradeoffs, and thus we will appeal to the following corollary repeatedly: 

Corollary 10. Suppose that V G W is a vertex transitive polytope with v vertices and let x be a 
vertex of this polytope. //4e 2 < v < 4exp{p/20} then 

g(T v (x)) < 201og(«/4). 
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Figure 2: (left) A signal set S consisting of x*; (middle) Two convex constraint sets C and C , where 
C is the convex hull of S and C is a relaxation that is more efficiently computable than C; (right) 
The tangent cone T^(x*) is contained inside the tangent cone Tc(x*). Consequently, the Gaussian 
squared-complexity g(Tc(x*) n Bj ) is smaller than the complexity <?(Tc(x*) n Bf ), so that the 
estimator x n (C) requires fewer samples than the estimator x n (C) for a risk of at most 1. 



Proof: The normal cones at the vertices of V partition MP. If the polytope is vertex-transitive, then 
the normal cones are all equivalent to each other (up to orthogonal transformations). Consequently, 
the (normalized) volume of the normal cone at any vertex is \/v. Since the normal cone at a vertex 
is polar to the tangent cone, we have the desired result from Proposition [9j □ 

Time-Data Tradeoffs 
Preliminaries 

We now turn our attention to giving examples of time-data tradeoffs in denoising problems via 
convex relaxation. As described previously, we must set a desired risk in order to realize a time- 
data tradeoff — in the examples in the rest of this section, we will fix the desired risk to be equal 
to 1 independent of the problem dimension p. Thus, these denoising problems lie in the time-data 
complexity classes TlD)(i(p), n(p), 1) for different runtime constraints t(p) and sample budgets n(p). 
The following corollary gives the number of samples required to obtain a mean squared error of 1 
via convex optimization in our denoising setup: 

Corollary 11. For x* € S and with S C C, if 

n>o- 2 g(T c (x*)nB p £2 ), 

^enE[||x*-x n (C)||f 2 ] < 1. 

Proof: The result follows by a rearrangement of the terms in the bound in Proposition HI □ 

This corollary states that if we have access to a dataset with n samples, then we can use any 
convex constraint set C such that the term on the right-hand-side in the corollary is smaller than n. 
Recalling that larger constraint sets C lead to larger tangent cones Tc, we observe that if n is large 
one can potentially use very weak (and computationally inexpensive) relaxations and still obtain 
a risk of 1. This observation, combined with the important point that the hierarchies of convex 
relaxations described previously are simultaneously ordered both by approximation quality and by 
computational tractability, allows us to realize a time-data tradeoff by using convex relaxation as 
an algorithm weakening mechanism. See Figure 2 for a simple demonstration. 
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We further consider settings with a 2 = 1 and in which our signal sets S C MP consist of 
elements that have Euclidean norm on the order of yfp (measured from the centroid of S). In such 
regimes the James-Stein shrinkage estimator |30| offers about the same level of performance as the 
maximum-likelihood estimator, and both these are outperformed in statistical risk by nonlinear 
estimators of the form ([2]) based on convex optimization. 

Finally, we briefly remark on the runtimes of our estimators. The runtime for each of the 
procedures below is calculated by adding the number of operations required to compute the sample 
mean y and the number of operations to solve (|2|) to some accuracy. Hence if the number of 
samples used is n and if fcip) denotes the number of operations required to project y onto C, then 
the total runtime is np + fcip)- Thus the number of samples enters the runtime calculations as just 
an additive term. As we process larger datasets, the first term in this calculation becomes larger 
but this increase is offset by a more substantial decrease in the second term due to the use of a 
computationally tractable convex relaxation. We note that such a runtime calculation extends to 
more general inference problems in which one employs estimators of the form (|2|) but with different 
loss functions in the objective — specifically, the runtime is calculated as above so long as the loss 
function depends only on some sufficient statistic computed from the data. If the loss function is 
instead of the form Y17=i ^( x ! y«) an d it cannot be summarized via a sufficient statistic of the data 
{y«}f=i> then the number of samples enters the runtime computation in a multiplicative manner in 
the number of operations fcip) required to compute the convex programming estimator. 

Example 1: Denoising Signed Matrices 

We consider the problem of recovering signed matrices corrupted by noise: 

5 = {aa / | ae{-l,+l}^}. 

We have a £ Kv'p so that S C MP. Inferring such signals is of interest in collaborative filtering where 
one wishes to approximate matrices as the sum of a small number of rank-one signed matrices [52]. 
Such matrices may represent, for example, the movie preferences of users as in the Netflix problem. 

The tightest convex constraint that one could employ in this case is C = conv(5), which is 
the cut polytope ([6]). In order to obtain a risk of 1 with this constraint, one requires n = c\^fp 
by applying Corollary [10] and Corollary [11] based on the symmetry of the cut polytope. The cut 
polytope is in general intractable to compute. Hence the best known algorithms to project onto 
C would require runtime that is super-polynomial in p. Consequently, the total runtime of this 
algorithm is c\p 1,5 + super-poly (p). 

A commonly used tractable relaxation of the cut polytope is the elliptope ([5]). By computing 
the Gaussian squared-complexity of the tangent cones at rank-one signed matrices with respect to 
this set, it is possible to show that n = C2y/p leads to a risk of 1 (with C2 > ci). Further, interior- 
point based convex optimization algorithms for solving ([2]) that exploit the special structure of 
the elliptope require 0(p 2,25 ) operational [11] [7] [29]. Hence the total runtime of this procedure is 
C2P 1 - 5 + Oip 2 - 25 ). 

Finally, an even weaker relaxation of the cut polytope than the elliptope is the unit ball of the 
nuclear norm scaled by a factor of ypp — one can verify that the elements of S lie on the boundary 
of this set, and are in fact extreme points. Appealing to Proposition [8] (using the fact that the 
elements of S are rank-one matrices) and Corollary II 11 we conclude that n = c^s/p samples provide 
a mean-squared error of 1 (with C3 > C2). Projecting onto the scaled nuclear norm ball can be done 

4 The exponent is a result of the manner in which we define our signal set so that a rank-one signed matrix lives 
in W. 
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by computing a singular value decomposition (SVD), and then truncating the sequence of singular 
values in descending order when their cumulative sum exceeds y/p (in effect projecting the vector 
of singular values onto an i\ ball of size -y/p). This operation requires 0(p L5 ) operations, and thus 
the total runtime is c^p 1 ' 5 + 0(p 15 ). 

To summarize, the cut-matrix denoising problem lives in the time-data class TB(super-poly(p), 
ci-s/p, 1), in Tro(C(p 2 - 25 ),c 2l /p, 1), and in TD(C(p L5 ), c 3 ^/p, 1), with constants c x < c 2 < c 3 . 

Example 2: Ordering Variables 

In many data analysis tasks, one is given a collection of variables that are suitably ordered so 
that the population covariance is banded. Under such a constraint, thresholding the entries of 
the empirical covariance matrix based on their distance from the diagonal has been shown to be 
a powerful method for estimation in the high-dimensional setting [8]. However, if an ordering 
of the variables is not known a priori, then one must jointly learn an ordering for the variables 
and estimate their underlying covariance. As a stylized version of this variable ordering problem, 
let M £ ]Rv / p x v / p be a known tridiagonal matrix (with Euclidean norm 0(^/p)) and consider the 
following signal set: 

S = {HA/IT | n is a yjp x yfp permutation matrix}. 

The matrix M here is to be viewed as a covariance matrix. Thus, the corresponding denoising 
problem ([I]) is that we wish to estimate a covariance matrix in the absence of knowledge of the 
ordering of the underlying variables. In a real- world scenario one might wish to consider covariance 
matrices M that belong to some class of banded matrices and then construct S as done here, but 
we stick with the case of a fixed M for simplicity. Further, the noise in a practical setting is 
better modeled as coming from a Wishart distribution — again, we focus on the Gaussian case for 
simplicity. 

The tightest convex constraint set that one could employ in this case is the convex hull of S, 
which is in general intractable to compute for arbitrary matrices M. For example, if one were 
able to compute this set in polynomial time for any tridiagonal matrix M, one would be able to 
solve the intractable longest path problem |24j (finding the longest path between any two vertices 
in a graph) in polynomial time. With this convex constraint set, we find using Corollary 1101 and 
Corollary 1111 that n = ciy^log^) samples would lead to a risk of 1. This follows from the fact 
that conv(5) is a vertex-transitive polytope with about {y/p)\ vertices. Thus, the total runtime is 
cip 1 ' 5 log(p) + super-poly (p). 

An efficiently computable relaxation of conv(5) is a scaled t\ ball (scaled by the t\ norm of 
M). Appealing to Proposition [7] on tangent cones with respect to the l\ ball and to Corollary \W\ 
we find that n = C2y/p\og{p) samples suffice to provide a risk of 1. In applying Proposition [71 we 
note that M is assumed to be tridiagonal and therefore has 0{y/p) nonzero entries. The runtime 
of this procedure is C2£> L5 log(p) + 0{p\og{p)). 

Thus the variable ordering denoising problem belongs to TD(super-poly(p), c\^/plog(p), 1) and 
to TD(C(p 1 - 5 log(p)), C2^/p\og{p), 1), with constants c\ < c 2 . 

Example 3: Sparse PCA and Network Activity Identification 

As our third example, we consider sparse PCA in which one wishes to learn from samples a sparse 
eigenvector that contains most of the energy of a covariance matrix. As a simplified version of this 
problem, one can imagine a matrix M E Rv / p x v^ with entries equal to ^/p/k in the top-left k x k 
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block and zeros elsewhere (so that the Euclidean norm of M is y/p), and with S defined as: 

S = {IiMIT | n is a y/p x y/p permutation matrix}. 

In addition to sparse PCA, such signal sets are also of interest in identifying activity in noisy 
networks [33j, as well as in related combinatorial optimization problems such as the planted clique 
problem [24J. In the sparse PCA context, Amini and Wainwright [3] study time-data tradeoffs by 
investigating the sample complexities of two procedures, a simple one based on thresholding and a 
more sophisticated one based on semidefinite programming. Kolar et al. [33] investigate the sample 
complexities of a number of procedures ranging from a combinatorial search method, thresholding, 
and sparse SVD. We note that the time-data tradeoffs studied in these two papers [31 [33] relate to 
the problem of learning the support of the leading sparse eigenvector; in contrast in our setup the 
objective is to simply denoise an element of S. Further while the Gaussian noise setting is of interest 
in some of these domains, in a more realistic sparse PCA problem (such as the one considered in 
[3]) the noise is Wishart rather than Gaussian as considered here. Nevertheless, we stick with our 
stylized problem setting as it provides some useful insights on time-data tradeoffs. Finally, the size 
of the block k G {1, . . . , y/p} depends on the application of interest and it is typically far from the 
extremes 1 and y/p — we will consider the case k ~ p 1 ^ 4 for concretenessH 

As usual, the tightest convex constraint set one can employ in this setting is the convex hull of 
S, which is in general intractable to compute — an efficient characterization of this polytope would 
lead to an efficient solution of the intractable planted-clique problem (finding a fully connected 
subgraph inside a larger graph). Using this convex constraint set gives an estimator that requires 
about n = 0(p 1 ^ log(p)) samples in order to produce a risk-1 estimate. We obtain this threshold 
by appealing to Corollary [10] and to Corollary [TTj and the observation that conv(5) is a vertex- 
transitive polytope with about (^J 4 ) vertices. Thus, the overall runtime is 0(p 5 ^ log(p)) + super- 
poly (p). 

A convex relaxation of conv(5) is the nuclear norm ball scaled by a factor of y/p so that the 
elements of S lie on the boundary. From Proposition [8] (observing that the elements of S are 
rank-one matrices) and Corollary II \\ we have that n = Cy/p samples give a risk-1 estimate with 
this procedure. As computed in the example with cut matrices, the overall runtime of this nuclear 
norm procedure is cp 1 ' 5 + 0(p 1 ' 5 ). 

In conclusion the denoising version of sparse PCA lies in TD(super-poly(p), O^ 1 / 4 log(p)), 1) 
and in Tro(C(p L5 ), 0(y/p), 1). 

Example 4: Estimating Matchings 

As our final example, we consider signals that represent the set of all perfect matchings in the 
complete graph. A matching is any subset of edges of a graph such that no node of the graph 
is incident to more than one edge in the subset, and a perfect matching is a subset of edges in 
which every node is incident to exactly one edge in the subset. Graph matchings arise in a range 
of inference problems such as in chemical structure analysis [35] and in network monitoring [51J. 
Letting M be the adjacency matrix of some perfect matching in the complete graph on y/p nodes, 
our signal set in this case is defined as follows: 

S = £> 1//4 {nMn' | LT is a y/p x y/p permutation matrix}. 

5 This setting is an interesting threshold case in the planted clique context [Tl 1221 12] where k = p 1 / 4 is the square- 
root of the number of nodes y/p of the graph represented by M (viewed as an adjacency matrix). 
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The scaling of p 1 / 4 ensures that the elements of S have Euclidean norm of ^Jp. Note that S C W. 
The number of elements in S is when Jp is an even number (this number is obtained 

by computing the product of all the odd integers up to ^Jp) . 

The tightest convex relaxation in this case is the convex hull of S. Unlike the previous three 
cases, projecting onto this convex set is in fact a polynomial-time operation^ with runtime about 
0(p 5 ). Appealing to Corollary [TU1 to Corollary [TT| and to the fact that conv(5) is a vertex- 
transitive polytope, we have that n = c\^/p\og{p) samples provides a risk-1 estimate. Hence the 
overall runtime is cip 1,5 log(p) + C(p 5 ). 

A tractable relaxation of the perfect matching polytope is a hypersimplex |54j . obtained by 
taking the convex hull of all ^Jp x ^Jp matrices consisting of ^Jp ones and the other entries being 
equal to zero. We scale this hypersimplex by a factor of p 1 / 4 so that the elements of S are on 
the boundary. The hypersimplex is also a vertex-transitive polytope like the perfect matching 
polytope, but with about (J^) entries. Hence from Corollary 1101 and from Corollary [TT1 we have that 
n = °2\/p^og{p) samples will provide a risk-1 estimate. Further, projecting onto the hypersimplex 
is a very efficient operation based on sorting and has a runtime of 0{p\og{p)). Consequently the 
total runtime of this procedure is C2P 15 log(p) + 0(plog(p)). 

In summary, the matching estimation problem is a member of TD(0(p 5 ),ci- s /plog(p), 1) and of 
TB(0(p L5 log(p)), C2 A /plog(p), 1) with constants c\ < C2- 

Some Observations 

A curious observation that we may take away from these examples is that it is possible to obtain 
substantial speedups computationally with just a constant factor increase in the size of the dataset. 
This suggests that in settings in which obtaining additional data is inexpensive, it may be more 
economical to procure more data and employ a more basic computational infrastructure rather 
than to process limited data using powerful and expensive computers. 

Our second observation is relevant to all the examples above but we highlight it in the context 
of denoising cut matrices. In that setting one can use an even weaker relaxation of the cut polytope 
than the nuclear norm ball, such as the Euclidean ball (suitably scaled). While projection onto this 
set is extremely efficient (requiring 0(p) operations as opposed to C(p 15 ) operations for projecting 
onto the nuclear norm ball), the number of samples required to achieve a risk of 1 with this 
approach is 0(p) — computing the sample mean with so many samples requires 0(p 2 ) operations, 
which leads to an overall runtime that is greater than the runtime 0(p 1 ' 5 ) for the nuclear norm 
approach. This point highlights an important tradeoff — if our choice of algorithms is between 
nuclear norm projection and Euclidean projection, and if we are in fact given access to 0(p) 
data samples, it makes sense computationally to retain only 0{y/p) samples for the nuclear norm 
procedure and throw away the remaining data. This provides a concrete illustration of several key 
issues. Aggregating massive datasets can frequently be very expensive computationally (relative 
to the other subsequent processing), and the number of operations required for this step must be 
taken into account|3 Consequently, in some cases it may make sense to throw away some data 
if preprocessing the full massive dataset is time-consuming. Hence, one may not be able to avail 

6 Edmonds' blossom algorithm [21] for computing maximum-weight matchings in polynomial time leads to a sep- 
aration oracle for this perfect matching polytope. Subsequently, Padberg and Rao 39; developed a faster separation 
oracle for the perfect matching polytope. These separation oracles in turn lead to polynomial-time projection algo- 
rithms via the ellipsoid method [7] . 

7 Note that the aggregation step is more time-consuming than the subsequent projection step in the £i-ball pro- 
jection procedure for ordering variables and in the hypersimplex projection method for denoising matchings. 
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oneself of weaker post-aggregation algorithms if these methods require such a large amount of data 
to achieve a desired risk that the aggregation step is expensive. This point goes back to the "floor" 
in Figure 1 in which one imagines a cutoff in the number of samples beyond which more data are 
not helpful in reducing computational runtime. Such a threshold, of course, depends on the space of 
algorithms one employs, and in the cut polytope context with the particular algorithms considered 
here the threshold occurs at O(yfp) samples. 

Conclusions 

In this paper we considered the problem of reducing the computational complexity of an inference 
task as one has access to larger datasets. The traditional goal in the theory of statistical inference 
is to understand the tradeoff in an estimation problem between the amount of data available and 
the risk attainable via some class of procedures. In an age of plentiful data in many settings and 
computational resources being the principal bottleneck, we believe that an increasingly important 
objective is to investigate the tradeoffs between computational and sample complexities. As one 
pursues this line of thinking, it becomes clear that a central theme must be the ability to weaken an 
inference procedure as one has access to larger datasets. Accordingly, we proposed convex relaxation 
as an algorithm weakening mechanism, and we investigated its efficacy in a class of denoising tasks. 
Our results suggest that such methods are especially effective in achieving time-data tradeoffs in 
high-dimensional parameter estimation. 

We close our discussion by outlining some exciting future research directions. As algorithm 
weakening is central to the viewpoint described in this paper, it should come as no surprise that 
several of the directions listed below involve interaction with important themes in computer science. 

Computation with streaming data In many massive data problems, one is presented with 
a stream of input data rather than a large fixed dataset, and an estimate may be desired after a 
fixed amount of time independent of the rate of the input stream. In such a setting an alternative 
viewpoint to the one presented in this paper might be more appropriate. Specifically, rather than 
keeping the risk fixed, one would keep the runtime fixed and trade off the risk with the rate of 
the input stream. One can imagine algorithm weakening mechanisms, dependent on the rate of 
the data stream, in which the initial data points are processed using sophisticated algorithms and 
subsequent samples are processed more coarsely. Understanding the tradeoffs in such a setting is 
of interest in a range of applications. 

Alternative algorithm weakening mechanisms The notion of weakening an inference algo- 
rithm is key to realizing a time-data tradeoff. While convex relaxation methods provide a powerful 
and general approach, a number of other weakening mechanisms are potentially relevant. For ex- 
ample, processing data more coarsely by quantization, dimension reduction, and clustering may 
be natural in some contexts. Coresets, which originated in the computational geometry commu- 
nity, summarize a large set of points via a small collection (see, for example, [23] and the references 
therein), and they could also provide a powerful algorithm weakening mechanism. Finally, we would 
like to mention a computer hardware concept that has implications for massive data analysis. A 
recent approach to designing computer chips is premised on the idea that many tasks do not require 
extremely accurate computation [6]. If one is willing to tolerate small, random errors in arithmetic 
computations (e.g., addition, multiplication), it may be possible to design chips that consume less 
power and are faster than traditional, more accurate chips. Translated to a data analysis context, 
such design principles may provide a hardware-based algorithm weakening mechanism. 
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Measuring quality of approximation of convex sets In the mathematical optimization and 
theoretical computer science communities, relaxations of convex sets have provided a powerful tool- 
box for designing approximation algorithms for intractable problems, most notably those arising 
in combinatorial optimization. The manner in which the quality of a relaxation translates to the 
quality of an approximation algorithm is usually quantified based on the integrality gap between 
the original convex set and its approximation [53]. However, the quantity of interest in a statisti- 
cal inference context in characterizing the quality of approximations is based on ratios of Gaussian 
squared-complexities of tangent cones. These two quantifications can be radically different — indeed, 
several of the relaxations presented in our time-data tradeoff examples that are useful in an in- 
ferential setting would provide poor performance in a combinatorial optimization context. More 
broadly, those examples demonstrate that weak relaxations frequently provide as good estimation 
performance as tighter ones with just an increase of a constant factor in the number of data samples. 
This observation suggests a potentially deeper result along the following lines — many computation- 
ally intractable convex sets for which there exist no tight efficiently-computable approximations as 
measured by integrality gap can nonetheless be well-approximated by computationally tractable 
convex sets, if the quality of approximation is measured based on statistical inference objectives. 
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Supplementary Information 

Proof of Proposition [5] 

As with the proof of Proposition [U we condition on z = z. Setting d = x — x and setting 
8 n {C) = x n (C)| z= z — x, we can rewrite the problem (J2J as follows: 



S n (C) = arg min - 



x) + -2-z - S 



S.t. S £ C — X. 



Letting Ri and R2 denote orthogonal subspaces that contain Q\ and Q2, i.e., Q\ C and Q2 ^ R2, 

and letting <5« = V Rl (S) f S^ = ^ 2 (<5), ^(C) = V Ri {8 n {C)),8^ {C) = V R2 (S n (C)) denote the 
projections of 6,6 n (C) onto Ri,R2, we can rewrite the above reformulated optimization problem 
as: 



arg mm 



+ 



V Rl [(x*-x) + -^z 
1 



5 (D 



x' 



5 (2) 



As the sets Q11Q2 live in orthogonal subspaces, the two variables d 1 - 1 ),^ 2 ) in this problem can be 

~ (2) 

optimized separately. Consequently, we have that ||<$ n (C)||^ 2 < a and that 

\\6™(C)\\ ta < sup <d,4jz + (x*-x)>. 

<5econe(Qi)nB? 



Combining the two bounds on 8 n \c) and 5^ \c), one can then check that 



This bound can be established following the same sequence of steps as in the proof of Proposition [U 

! tw< 

*0),„. , ,,4(2), 

\ d n 

To obtain a bound on ||x n (C)| 2 



K' (C)\\l + WW (C)\\i <2 ^(oaneCQOn^J + IK-xia 



+ a 2 . 



x 



* 1 1 2 



we note that 



l x n( C )|s 



x*|| 2 2 < 2[||x n (C)| 



xll 2 



z z X lll2 "I" ll X 

< 2||^(C)|| 2 + 2||<sf (C)||l + 2||x -x 



( 1 )^\n2 1 oi|i( 2 )/'/iM|2 1 oil,,* ,%I|2 



Taking expectations concludes the proof. □ 



Proof of Proposition [9] 

The main steps of this proof follow the steps of a similar result in [15] , with the principal difference 
being that we wish to bound Gaussian squared-complexity rather than Gaussian complexity. A 
central theme in this proof is the appeal to Gaussian isoperimetry. Let denote the sphere in p 
dimensions. Then in bounding the expected squared-distance to the dual cone K* with K* n S p_1 
having a volume of fx, we need only consider the extremal case of a spherical cap in S^ -1 having 
a volume of /i. The manner in which this is made precise will become clear in the proof. Before 
proceeding with the main proof, we state and derive a result on the solid angle subtended by a 
spherical cap in § p_1 to which we will need to appeal repeatedly: 
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Lemma 2. Let ip((J,) denote the solid angle subtended by a spherical cap in § p 1 with volume 
fie QexpL-p/20},^). Then 



I 


21 °g(i) 1 


p-1 






J 



Proof of Lemma [2) Consider the following definition of a spherical cap, parametrized by height 
h: 

J = {a£ gf- 1 | ai > h}. 

Here ai denotes the first coordinate of a S R p . Given a spherical cap of height h £ [0, 1], the solid 
angle if) is given by: 



TT 



sin -1 (fc). 



(10) 



We can thus obtain bounds on the solid angle of a spherical cap via bounds on its height. The 
following result from [12] relates the volume of a spherical cap to its height: 



Lemma 3. \12§ For < h < 1 the volume fl(p, h) of a spherical cap of height h inEP 1 is bounded 
as 



1 (i_^)V <ji(p,h) < 



1 



10h^/p 



2h y Jp' 



'\-h l 



p-1 



Continuing with the proof of Lemma El note that for -7= < h < 1 

VP 

_J_(1 - /,2 ) 2 2 1 < i (l _ h 2 f^ < 1 (-Ezitf 



we have ^= < /t < 1 based on the assumption ^ S (|exp{— p/20}, 4^)- 



Choosing /i = 

Consequently, we can apply Lemma with this value of h combined with (|l(jp to conclude that 
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21 °g(i) 
p — 1 



< 



/ 



Hence the solid angle ifi ^/i ^p, 
use (fTUl) to conclude that 



2 log 



4^ / 



p-1 



7T 



is less than the solid angle tp(fi). Consequently, we 
/ 



\ 



V 



21 °g(i) 

p — 1 



/ 



Using the bound sin (/i) < we obtain the desired bound. □ 

Proof of Proposition [9j We bound the Gaussian squared-complexity of K, by bounding the 
expected squared-distance to the polar cone /C*. Let fl(U;t) for U C S?" 1 and t > denote 
the volume of the set of points in S p_1 that are within a Euclidean distance of at most t from 
U (recall that the volume of this set is equivalent to the measure of the set with respect to the 
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normalized Haar measure on §P 1 ). We have the following sequence of relations by appealing to 
the independence of the direction g/||g||^ 2 and of the length ||g||£ 2 of a standard normal vector g: 

E[dist(g,/C*) 2 ] = E[||g|| 2 2 dist(g/||g||, 2 ,/C*) 2 ] 
= P E[dist(g/||g||, 2 ,/C*) 2 ] 
< P E[dist(g/||g||, 2 ,/C*nS p - 1 ) 2 ] 



poo 

/ P[dist( g /||g|| fe ,/c*n§p- 1 ) 2 >t]dt 

Jo 

/ P[dist( g /||g||, 2 ,/c*ns p - 1 ) > Vt]dt 

Jo 

poo 

2p / sP[dist(g/||g||, 2 , K* n S^ 1 ) > s]ds 
Jo 

pOO 

2 P / sfi-^mr 1 ;^. 



Here the third equality follows based on the integral version of the expected value. Let V C SP 1 
denote a spherical cap with the same volume \i as /C*n§ p_1 . Then we have by spherical isoperimetry 
that fliV; s) > jiQC n SP -1 ; s) for all s > |37]. Thus 

poo 

E[dist(g,/C*) 2 ] < 2p / s[l - fl(V;s)]ds. (11) 
Jo 

From here onward, we focus exclusively on bounding the integral. 

Let r(ip) denote the volume of a spherical cap subtending a solid angle of ip radians. Recall 
that ip is a quantity between and ir. As in Lemma[2]let denote the solid angle of a spherical 
cone subtending a solid angle of [i. Since the Euclidean distance between points on a sphere is 
always smaller than the geodesic distance, we have that ft(V; s) > r( , 0(/i) + s). Further, we have 
the following explicit formula for r{ip) |32j : 



where w p = J Q W sin p 1 (v)dv is the normalization constant. Combining these latter two observations, 
we can bound the integral in (|lip as: 



poo poo 

I s[l- p,(V;s)]ds < / s[l - r(^(/i) + s)]ds 
Jo Jo 



7T— 0(/Lt) 

s[l - r(^(^t) + s)]ds 

' sr(ip(ij) + s)ds 
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w p / / ssim (v)dvds 
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Next we change the order of integration to obtain: 



/ s[l -fl{V;s)]ds < 
Jo 



n fTT—tpdj,) 
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p - l {v)sdsdv 



(7F ~ 1 /* J [Or - ^)) 2 - (max{, - ip(fi), 0}) 2 ] sin*" 1 ^ 
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(max{f — t/KaOjO}) 2 sin p 1 (v)dv 



— V'(^)) 2 sin p 



We now appeal to the inequalities u „ 1 < \/p — 1/2 and sin(x) < exp(— (x — §) 2 /2) for x G [0, ir] to 



obtain 



s[l - p,(V; s)]ds < yi \ I (v - V(A*)) 2 exp 



/o 2 
Performing a change of variables with a = \Jp — l(v — we have 

1 rs/p^-Kfl 
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Here the inequality was obtained by suitably changing the limits of integration. We now employ 
Lemma [2] to obtain the final bound: 



5(/Cn£ff) < p 



2 log XT 



p-1 
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i + ^iogfil+v^A/logfi 



4 M 



Here the final bound holds because [i < l/4e 2 and p > 12. □ 
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